Nonparametric Bayes Classification via 
Learning of Affine Subspaces 



Abhishek Bhattacharya 
Indian Statistical Institute 

based on the paper Density Estimation and Classification 

via Bayesian Nonparametric Learning of Affine 
Subspaces jointly with David Dunson & Garritt Page, 2012 



January 10, 2013 



II- » h wiinmnr 



Non parametric Bayes Classification via Learning of Affine Subspaces 




IMotivation & Goall 

Framework! 

Model 



Prior Choice] 



Weak Posterior Consistency 



Strong Posterior Consistency 



Principal Subspace Classifier 



Estimating the Principal Subspace 



Identifiability of the Principal Subspace 



Illustrations With Real Data Setsl 



Brain Computer Interface Data 



Wisconsin Breast Cancer data setl 



Summary 



[E |Further Work possible 
E Reference's! 



i II- * h ttim- nnn 



Non parametric Bayes Classification via Learning of Affine Subspaces 












What are we interested in? 







■ Build efficient nonparametric Bayes classifiers in presence 
of many predictors. 



Ill- » H Hill ft 



Non parametric Bayes Classification via Learning of Affine Subspaces 












What are we interested in? 







■ Build efficient nonparametric Bayes classifiers in presence 
of many predictors. 



■ Different cell probabilities allowed to vary 
non-parametrically based on a few coordinates expressed 
as linear combinations of the predictors. 



Ill- » H Hill ft 



Non parametric Bayes Classification via Learning of Affine Subspaces 
1 




HMotivation & Goall 




What are we interested in? 







■ Build efficient nonparametric Bayes classifiers in presence 
of many predictors. 



■ Different cell probabilities allowed to vary 
non-parametrically based on a few coordinates expressed 
as linear combinations of the predictors. 

■ Model parameters clearly interpretable and provide insight 
to which predictors are important in constructing accurate 
classification boundaries. 



Ill- » H Hill ft 



Non parametric Bayes Classification via Learning of Affine Subspaces 
1 




HMotivation & Goall 




What are we interested in? 







■ Build efficient nonparametric Bayes classifiers in presence 
of many predictors. 



■ Different cell probabilities allowed to vary 
non-parametrically based on a few coordinates expressed 
as linear combinations of the predictors. 

■ Model parameters clearly interpretable and provide insight 
to which predictors are important in constructing accurate 
classification boundaries. 

■ Estimated cell probabilities consistent in weak and strong 
sense. 

Ill- * H Hill ft 



Nonparametric Bayes Classification via Learning of Affine Subspaces 




What are we interested in? 



Build efficient nonparametric Bayes classifiers in presence 
of many predictors. 

Different cell probabilities allowed to vary 
non-parametrically based on a few coordinates expressed 
as linear combinations of the predictors. 

Model parameters clearly interpretable and provide insight 
to which predictors are important in constructing accurate 
classification boundaries. 

Estimated cell probabilities consistent in weak and strong 
sense. 

Data applications support the results. , j „ m 




Affine Subspace Characterization 



■ Let 5 be an affine subspace of 5?'" of dimension k{k<€. m). 

■ Let 6 e K" 1 be the projection of the origin in S and R e j? mxm 
the projection matrix of the linear subspace parallel to S. 

■ Hence R = R' = R 2 , rank(7?) =k,R6 = 0. 

■ Let R = UU', U e V k ,m = {U e Bf" x * : C/'t/ = 4} - the 
Steifel manifold. 

■ Any x g 5 can be given isometric coordinates x = u'x € ^ 
s.t. x = t/ic + 0. 

Ill- H Hill ft 



■ For x g W, its projection P s (x) =Rx + 9 has coordinates 

U'x G K*. 

■ The residual R s (x) =x- Ps(x) lies in a linear subspace 5 
perpendicular to S with projection matrix I - R = W, 

v e v m -Km, v'u = o. 

■ It has coordinates V(jc - 9) in K m -*. 
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Joint Density Model 



■ Let X denote the predictor in K m and Y a categorical 
response taking values in Y = {1, . . . ,c}. 

■ Will estimate the conditional class probabilities by 
modeling the joint of (X, Y) s.t. Y depends on X only 
through its projection onto S. 

■ {Ps(X), Y) has a nonparametric kernel mixture density in 
5 x M c while independently Rs{X) follows a mean zero 
parametric model on S L . 
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Say (U'X, Y) ~ f mxSc N k (x; fx, ?>i)M c {y- u)P{dfidv) where 

N k denotes the &-variate Normal kernel, 

M c (y; v) = YYi =l Vi is the multinomial kernel and 

S c = {ve[0,l] c :J2"l = n- 



Independently V (X 



N m - k (0,X 2 ). 
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■ Then (X,Y) ~ J^ xSc N m (x; Ufi + 9, S)M c (y; v)P{dydv) 
where 

■ £ = UEtU' + VE 2 V. 

■ Wlog can take Si and S 2 to be diagonal. 

■ For sparsity assume S 2 = o\l m -k, i-e. the X residuals are 
homogeneously distributed. 

■ Let Si = diag(tr? , . . . , of). 
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■ Then £ = t/(£i - apkjU' + a\l m and the model 
parameters are 

■ k, U e Vfc )m , 8 £ $t m satisfying t/'0 = 0, £ = (a , a u ... ,a k ) - 
a positive vector and P - a probability on 5ft* x 5 C . 

■ For Bayesian n.p. inference set priors on the parameters 
s.t. the induced prior on the joint density has full support 
and the posterior estimate is consistent. 
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Prior Choice on 6 



■ Common prior choice on = (k, U, 9,a,P) that preserves 
conjugacy can be 

■ a discrete prior on k and given k, 

■ a matrix Bingham-von Mises-Fisher density on U which 
has the form proportional to expTr(£/A + UBU'C), 

■ a m-variate Normal on restricted to the space of vectors 
orthogonal to U, 

■ inverse-Gamma priors on the elements of a, and, 
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a Dirichlet process (DP) prior on P: P ~ DP (w (P ® go)), 
where P is a &-variate Normal and Q Q a Dirichlet 
distribution on S c . 

When P is discrete, say, P = Y^i w A«/>^)' tnen 

oo 

P(Y = y\X = x-e) = Y,">j(U'x)M c (y; uj) 



wh 



ere wdx) = ^ Nk( ^f% „ xeft k 



Markov chain Monte Carlo (MCMC) methods can be 
employed to draw from the posterior. 

Choice of o.n. basis leads to rapid convergence and avoids 
larqe dimensional matrix inversion. 
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Consistency of the Conditional Class Probabilities 



To show that the conditional density of Y given X under the 
posterior is consistent. 

Assume the following on/, - the true joint density of (X,Y). 

D <f t (x,y) < A for some constant A for all (x,y) e W x Y. 

B E t \\og{f t (X,Y)}\ <oo. 

□ For some 5 > 0, log £2^2 < oo, where 

f s (x,y) =Mz.\\x_ x \\ <s f t (x,y). 
Q For some a > 0, £V||X|| 2(1+a)m < oo. 
Here £ f denotes expectation under/. 
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■ Define probability P, on K m x S c as 

C 

Pt(dfjdv) = ^2f t ((iJ)d(n)S ej (du) 

7=1 

where ej is the vector with 1 as j'th coordinate and zeros 
elsewhere. 

■ Set priors on the parameters such that given k; (U, 9), a 
and P are conditionally independent. 

■ Let (X n ,Y„) = (X l ,Y l ),...,(X n , Y n ) Mf t . 
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■ Weak Posterior Consistency 

Weak Posterior Consistency (WPC) 



Theorem (Weak Posterior Consistency (WPC)) 



LetPr(k = m) > and the conditional priors on a andP given 
k = m contain and P t in their weak supports respectively. 
Then under assumptions 1 -4 onf t , the Kullback-Leibler (KL) 
condition is satisfied by the induced prior on f atf t . 

The proof runs on the same lines of the proof of Theorem 3.1 . 
Bhattacharya, Page & Dunson 2012. 
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Theorem (Weak Posterior Consistency (WPC)) 



LetPr(k = m) > and the conditional priors on a andP given 
k = m contain and P t in their weak supports respectively. 
Then under assumptions 1 -4 onf t , the Kullback-Leibler (KL) 
condition is satisfied by the induced prior on f atf t . 

The proof runs on the same lines of the proof of Theorem 3.1 . 
Bhattacharya, Page & Dunson 2012. 

This in turn implies a.s. WPC which implies Ve > 0, 

n„ {\P(Y = y\X e U; G) - P t (Y = y\X ££/)[> e} —> a.S. P t 

where n„ denotes the posterior of given (X„, Y„). 
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Strong Posterior Consistency (SPC) 



Theorem (Strong Posterior Consistency (SPC)) 



Assume the conditions for WPC hold. Pick positive constants 
a,b, {n}™^ and A and set the prior s.t. fork < m - 1, ||6>|| fl 
follows a Gamma density, max(a) < A l / b , and 
Pr(min(a) < n~ l / b \k) decays exponentially with n. This holds for 
e.g. with ajs all equal and oj h following a Gamma density 
truncated to [A~\ oo). For the DP (w k (P k ® Q )) prior on P, 
k > l, choose P k to be a Normal density on ^ k with variance 
Tll k . Then a.s. SPC results if the constants satisfy -rf > 4A 2 , 
a < 2(1 + a)m and l/a + l/b < l/m. 
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Strong Posterior Consistency 



Proof follows from the proof of Theorem 3.5. Bhattacharya, 
Page & Dunson 2012. 
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Proof follows from the proof of Theorem 3.5. Bhattacharya, 
Page & Dunson 2012. 

SPC implies 

n„( f \P(Y=y\X = x;Q)-P t (Y=y\X=x)\g t (x)dx>e 

a.s. P t My 
with g t the density of X under P t . 
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■ A Inverse Gamma prior on a satisfies the requirements for 
weak but not strong posterior consistency. 



II- » h Hiinmnr 



■ A Inverse Gamma prior on a satisfies the requirements for 
weak but not strong posterior consistency. 



■ In Bhattacharya & Dunson 2011, a gamma prior is proved 
eligible when k = m as long as the hyperparameters are 
allowed to depend on sample size n in a suitable way. 

■ However there it is assumed that/, has a compact support. 

■ The result is expected to hold true in this context too. 
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1— | Principal Subspace Classifier! 






Principal Subspace Classifier (PSC) 





■ The marginal density of X is 

X~g(x;Q)= [ N m (x;(t)(tJ,),X)Pi(dtJ,), 
<p(fi) = Ufi + 9, £ = UEiU' + VE 2 V , 
Pi is the n marginal of P. 

■ The X component on which Y depends is the ^-principal 
component of X if the eigenvalues of Si are greater than or 
equal to those of S 2 (and P is non-degenerate). 

■ This holds if £ = o%L 

I II- $ 



■ In some sense the model can be considered a Bayesian 
nonparametric extension of the probabilistic PCA of 
Tipping & Bishop 1999 and Nyamundanda et. al. 2010. 

■ The model could also be thought of as a nonparametric 
extension of the Bayesian Gaussian process latent variable 
models of Titsias & Lawrence 2010 and SVD models of 
Hoff2007. 
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Estimating S 



■ To obtain a Bayes estimate for the subspace S, choose an 
appropriate loss function and minimize the Bayes risk w.r.t. 
the posterior distribution. 

■ 5 is characterized by its projection matrix R and origin 6, 
i.e. the pair (R,9). 

■ R e $t. mxm , 6 eF satisfy R = R' = R 2 and R6 = 0. We use 
S m to denote the space of all such pairs. 
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■ One particular loss function on S m is 

L((R u e l ),(R 2 ,6 2 )) = \\R X -R 2 \\ 2 + \\e, -6 2 \\\ (Ri,0i)eS, 
where ||A|| 2 = £\.a| = Tr(AA'). 

■ Then a point estimate for (R, 6) is the (R\,9i) minimizing 
the posterior expectation of loss L over (R 2 , 9 2 ), provided 
there is a unique minimizer. 
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Estimating the Principal Subspace 



Theorem (Subspace Estimator) 



Letf(R,0) = J {R2 e2) L((R,9),(R 2 ,e 2 ))dP n (R 2 ,e 2 ), (R,9) G S m . 

This function is minimized byR = Y.)=\ u j u j and 9 = (I -R)9 
where R and 9 are the posterior means ofR 2 and 9 2 
respectively, 

m 

2R -99' = XjUjUj, Ai > . . . > X m 
7=1 

is a s.v.d. oflR - 00', andk minimizes k - Ylj=\ -V The 
minimizer is unique iff there is a unique minimimizer k and 
X k > X k+ i for that k. 
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Estimating the Principal Subspace 



Proof follows from Bhattacharya et. al. 201 2 and 
Bhattacharya, A. & Bhattacharya, Ft. 2012. 
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■ Proof follows from Bhattacharya et. al. 201 2 and 

Bhattacharya, A. & Bhattacharya, Ft. 2012. 

■ The relative importance of different features {X u ... ,X m } in 
explaining Y can then be judged by the magnitude of the 
corresponding diagonal entry of/?. 

■ The magnitudes can also be used to group the features 
according to their relative importance. 
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Identif iability of S 



X ~ N m (0, S) * (Pi o cf)- 1 ), with "*" denoting convolution. 

The characteristic function of X is 

$x(t) = exp(-l/2?'Ef)* Pi0 ,- 1 (f) ; t g W\ 



If a discrete P is employed, then £ and Pi o 1 can be 
uniquely determined from the marginal of X. 

Pi o cf)- 1 is a distribution on K" 1 supported on S = <p($l k ). 



II- » h ttiinmnr 



■ Define the affine support of a probability Q, asupp(<2) as 
the intersection of all affine subspaces having prob. 1 . It 
contains the support supp(g) (but may be larger). 

■ To identify S and k we assume that asupp(Pi) is 3ft*. 

■ Then asupp(Pi o is an affine subspace of 3? m of 
dimension equal to that of asupp(Pi) = k. 
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■ Since asupp(Pi o is identifiable, this implies that k is 
also identifiable as its dimension. 

■ Since S contains asupp(P o <p- 1 ) and has dimension equal 
to that of asupp(P o <fi~ l ), s = asupp(P o <f>~ 1 ). 

■ Then R = UU' and 6 are identifiable as the projection 
matrix and origin of S. 
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Real Data Examples 



■ The classifier built (PSC) is used in real data examples and 
its performance compared with other well known 
classification methods. 
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Real Data Examples 



■ The classifier built (PSC) is used in real data examples and 
its performance compared with other well known 
classification methods. 

■ Three such competitors considered are k nearest neighbor 
(KNN), mixture discriminant analysis (MDA), and support 
vector machine (SVM). 
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■ KNN is algorithmic based and classifies well in a variety of 
settings. A range of neighborhood sizes are considered 
with the one producing the best out of sample prediction 
ultimately used. 



II- » h Hiinmnr 



■ KNN is algorithmic based and classifies well in a variety of 
settings. A range of neighborhood sizes are considered 
with the one producing the best out of sample prediction 
ultimately used. 

■ MDA is a flexible model based Gaussian mixture classifier 
(see Hastie & Tibshirani 1996). The number of 
components in the Gaussian mixture chosen to produce 
the best out of sample prediction. 



II- » h Hiinmnr 



■ KNN is algorithmic based and classifies well in a variety of 
settings. A range of neighborhood sizes are considered 
with the one producing the best out of sample prediction 
ultimately used. 

■ MDA is a flexible model based Gaussian mixture classifier 
(see Hastie & Tibshirani 1996). The number of 
components in the Gaussian mixture chosen to produce 
the best out of sample prediction. 

■ SVM is a very accurate classifier and is therefore included. 



II- » h Hiinmnr 



■ KNN is algorithmic based and classifies well in a variety of 
settings. A range of neighborhood sizes are considered 
with the one producing the best out of sample prediction 
ultimately used. 

■ MDA is a flexible model based Gaussian mixture classifier 
(see Hastie & Tibshirani 1996). The number of 
components in the Gaussian mixture chosen to produce 
the best out of sample prediction. 

■ SVM is a very accurate classifier and is therefore included. 

■ Out of sample prediction error rates used to compare PSC 
to the 3 competitors. 
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L- Brain Computer Interface DatE 


• 


Brain Computer Interface (BCI) Data 





■ The BCI dataset consists of a single person performing 
400 trials in each of which he imagined movements with 
either the left hand or the right hand. 

■ For each trial, EEG recorded from 39 electrodes. 

■ An autoregressive model of order 3 was fit to each of the 
resulting 39 time series. 
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Illustrations With Real Data Sets 



Brain Computer Interface Data 



■ The trial is then represented by the total of 117 = 39 x 3 
dimensional feature space. 

■ Goal is to classify each trial as left or right hand 
movements using the 117 features. 

■ 200 observations randomly selected to serve as testing 
data. 

■ Posterior combinations done with dimension k fixed. 
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Illustrations With Real Data Sets 



Brain Computer Interface Data 



■ To select a k the out of sample prediction error rates and 
area under the receiver operating characteristic (ROC) 
curve are employed. 

■ Since low out of sample prediction error rates and large 
areas under the curve are desirable, a lvalue at-most 25 
that maximized the difference between them is selected. 

■ Following this criteria, k = 3 chosen. 

■ PSC produces an out of sample prediction error rate of 
0.205 compared to 0.51 for KNN, 0.25 for MDA and 0.23 
for SVM. 
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Wisconsin Breast Cancer (WBC) data set 



■ In this data set the response is breast cancer diagnosis 
while the covariates consists of 9 nominal variables 
describing some type of breast tissue cell characteristic. 

■ Although this data set is not high dimensional, it provides a 
nice illustration of the type of information the PSC can 
provide regarding associations between covariates and 
response. 

■ Similar to what was done with the BCI data set k = 3 is 
selected. 



This results in an out of sample prediction error rate of 

0.017 which is smaller than the error rate for KNN (0.035), 

MDA (0.028) and SVM (0.028). , „ , „ , , 



Nonparametric Bayes Classification via Learning of Affine Subspaces 
^Illustrations With Real Data Setsl 
I— Wisconsin Breast Cancer data set 



■ Even though the PSC classifies more accurately than the 
other methods, what is of particular interest is how each of 
the 9 tumor attributes influence classification. 

■ The 9 attributes (clump thickness, uniformity of cell size, 
uniformity of cell shape, marginal adhesion, single 
epithelial cell size, bare nuclei, bland chromatin, normal 
nucleoli, and mitosis) are all related to a lump being benign 
or not. 

■ From the theorem on subspace estimation the estimated 
principal directions are found in the Table below. 
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Theorem (Subspace Estimator) 



Letf(R,0) = J {R2 e2) L((R,9),(R 2 ,e 2 ))dP n (R 2 ,e 2 ), (R,9) G S m . 

This function is minimized byR = Y.)=\ u j u j and 9 = (I -R)9 
where R and 9 are the posterior means ofR 2 and 9 2 
respectively, 

m 

2R -99' = XjUjUj, Ai > . . . > X m 
7=1 

is a s.v.d. oflR - 00', andk minimizes k - Ylj=\ -V The 
minimizer is unique iff there is a unique minimimizer k and 
X k > X k+ i for that k. 
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^Illustrations With Real Data Setsl 
I— Wisconsin Breast Cancer data set 



Table: The k = 3 principal directions of the Breast Cancer data set 
along with the row norms 



Variable 


^[,1] 


U l,2] 




norm 


clump thickness 


-0.294 


0.233 


0.453 


0, 


.588 


uniformity of cell size 


-0.399 


-0.132 


-0.189 


0, 


.460 


uniformity of cell shape 


-0.395 


-0.102 


0.0172 


0, 


.408 


marginal adhesion 


-0.314 


-0.007 


-0.477 


0, 


,571 


single epithelial cell size 


-0.231 


-0.181 


-0.307 


0, 


.424 


bare nuclei 


-0.450 


0.713 


0.101 


0, 


.849 


bland chromatin 


-0.295 


-0.032 


-0.194 


0, 


.354 


normal nucleoli 


-0.376 


-0.587 


0.543 


0, 


.883 


mitosis 


-0.121 


-0.173 


-0.305 


0, 


,371 
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■ A way to assess the relative importance of each variable 
and also provide a means of grouping the variables is to 
calculate the norm associated with each row of U (i.e. the 
norm of the corresponding diagonal entry of R = UU'). 

■ These values can be found under the header "norm" in the 
Table. 

■ It appears that a bare nuclei and normal nucleoli form a 
group. 

■ Another is formed by clump thickness and marginal 
adhesion. 

■ Finally it appears that uniformity of cell size, uniformity of 

cell shape and single epithelial cell siz e form a group. 
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Summary 



■ A flexible nonparametric model proposed for classification 
via feature space dimension reduction. 

■ The model satisfies large support & consistency 
conditions. 
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Summary 



■ A flexible nonparametric model proposed for classification 
via feature space dimension reduction. 

■ The model satisfies large support & consistency 
conditions. 

■ A simple Gibbs sampler can be implemented with 
conjugate sampling steps for posterior sampling. 
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Summary 



■ A flexible nonparametric model proposed for classification 
via feature space dimension reduction. 

■ The model satisfies large support & consistency 
conditions. 

■ A simple Gibbs sampler can be implemented with 
conjugate sampling steps for posterior sampling. 

■ Better performance than commonly used machine 
learning, computer science and parametric statistical 
methods. 
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■ These methods are algorithmic or highly parameterized 
black boxes and apart from classification, provide no 
further information specific to the problem being studied. 
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influential in explaining the response - an information 
applied scientists often highly value. 
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■ These methods are algorithmic or highly parameterized 
black boxes and apart from classification, provide no 
further information specific to the problem being studied. 



■ In addition to building efficient classifiers, the proposed 
methodology provides insight regarding predictors that are 
influential in explaining the response - an information 
applied scientists often highly value. 

■ Can easily be extended to other regression setup. 
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Further Work possible 



■ Change the joint kernel choice to build better classifier. 
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Further Work possible 



Change the joint kernel choice to build better classifier. 

Change the notion of inner product to use non linear 
predictor transformations to explain the response. 
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■ Change the joint kernel choice to build better classifier. 

■ Change the notion of inner product to use non linear 
predictor transformations to explain the response. 

■ A nonparametric model may be fit on the non-signal 
predictors as well. 

■ Use other priors besides Dirichlet Process. 
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Further Work possible 



■ Change the joint kernel choice to build better classifier. 

■ Change the notion of inner product to use non linear 
predictor transformations to explain the response. 

■ A nonparametric model may be fit on the non-signal 
predictors as well. 

■ Use other priors besides Dirichlet Process. 

■ Extend to nonparametric hypothesis testing on the lines of 
Bhattacharya & Dunson 2012. 
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